Exploration of white wine quality by Aarthy Vallur

Project description

In this project, I have used exploratory data analysis to determine the physical and chemical properties that could affect the quality of white wines. R packages have been primarily used for analysis and visualization. The data set is available here https://s3.amazonaws.com/udacity-hosted-downloads/ud651/wineQualityWhites.csv and background on the data set and its attributes are available here https://s3.amazonaws.com/udacity-hosted-downloads/ud651/wineQualityInfo.txt. First, the necessary packages were loaded and the data was read.

Univariate Plots Section

Summary of dataset

Structure and attributes of the dataset was explored. The column heading for Column 1 was changed to Wine ID to reflect its contents.

## 'data.frame':    4898 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
##  $ volatile.acidity    : num  0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
##  $ citric.acid         : num  0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
##  $ residual.sugar      : num  20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
##  $ chlorides           : num  0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
##  $ free.sulfur.dioxide : num  45 14 30 47 47 30 30 45 14 28 ...
##  $ total.sulfur.dioxide: num  170 132 97 186 186 97 136 170 132 129 ...
##  $ density             : num  1.001 0.994 0.995 0.996 0.996 ...
##  $ pH                  : num  3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
##  $ sulphates           : num  0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
##  $ alcohol             : num  8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
##  $ quality             : int  6 6 6 6 6 6 6 6 6 6 ...
##        X        fixed.acidity    volatile.acidity  citric.acid    
##  Min.   :   1   Min.   : 3.800   Min.   :0.0800   Min.   :0.0000  
##  1st Qu.:1225   1st Qu.: 6.300   1st Qu.:0.2100   1st Qu.:0.2700  
##  Median :2450   Median : 6.800   Median :0.2600   Median :0.3200  
##  Mean   :2450   Mean   : 6.855   Mean   :0.2782   Mean   :0.3342  
##  3rd Qu.:3674   3rd Qu.: 7.300   3rd Qu.:0.3200   3rd Qu.:0.3900  
##  Max.   :4898   Max.   :14.200   Max.   :1.1000   Max.   :1.6600  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.600   Min.   :0.00900   Min.   :  2.00     
##  1st Qu.: 1.700   1st Qu.:0.03600   1st Qu.: 23.00     
##  Median : 5.200   Median :0.04300   Median : 34.00     
##  Mean   : 6.391   Mean   :0.04577   Mean   : 35.31     
##  3rd Qu.: 9.900   3rd Qu.:0.05000   3rd Qu.: 46.00     
##  Max.   :65.800   Max.   :0.34600   Max.   :289.00     
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  9.0        Min.   :0.9871   Min.   :2.720   Min.   :0.2200  
##  1st Qu.:108.0        1st Qu.:0.9917   1st Qu.:3.090   1st Qu.:0.4100  
##  Median :134.0        Median :0.9937   Median :3.180   Median :0.4700  
##  Mean   :138.4        Mean   :0.9940   Mean   :3.188   Mean   :0.4898  
##  3rd Qu.:167.0        3rd Qu.:0.9961   3rd Qu.:3.280   3rd Qu.:0.5500  
##  Max.   :440.0        Max.   :1.0390   Max.   :3.820   Max.   :1.0800  
##     alcohol         quality     
##  Min.   : 8.00   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.40   Median :6.000  
##  Mean   :10.51   Mean   :5.878  
##  3rd Qu.:11.40   3rd Qu.:6.000  
##  Max.   :14.20   Max.   :9.000
##  [1] "Wine ID"              "fixed.acidity"        "volatile.acidity"    
##  [4] "citric.acid"          "residual.sugar"       "chlorides"           
##  [7] "free.sulfur.dioxide"  "total.sulfur.dioxide" "density"             
## [10] "pH"                   "sulphates"            "alcohol"             
## [13] "quality"

Univariate Plots

In this section, each variable was plotted to see how white wines are distributed using histograms and boxplots. A summary variable was explored first. A histogram of how white wines were distributed in terms of the experts’ ratings, as summaried in the variable quality was plotted , following which, other variables were plotted. First, acidity attributes and pH of white wines were plotted. Then other chemical components were plotted. Before plotting, descriptive statistics for each variable was obtained to help choose bin widths and axis limits.

Histogram for visualising distribution of of white wines

by quality

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.878   6.000   9.000

White wines of higher quality than 6 and then those under 5 were counted.

## 
## FALSE  TRUE 
##  3838  1060
## 
## FALSE  TRUE 
##  4715   183

From the above distribution, it is clear that a majority of the white wines fall in the medium quality range of 5-6 (3655). Lower quality wines were fewer (183) and so were higher quality wines (1060). The analysis in this project will focus on what variables contribute to this division and how they converge to create wines that can be categorised by quality.

Histograms for visualising distribution of acidity

aspects of white wines

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.800   6.300   6.800   6.855   7.300  14.200
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0800  0.2100  0.2600  0.2782  0.3200  1.1000
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.2700  0.3200  0.3342  0.3900  1.6600
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.720   3.090   3.180   3.188   3.280   3.820

Acidity is an important aspect of white wines which contributes to its taste and finish. pH measures overall cidity and in white wines is between 3-3.3. Fixed acidity is a function of the grapes used for making the wine, while citric acid contributes to its crisp taste and finish. Volatile acidity is a contribution of acetic acid and generally, must be very low in wines chosen for consumption. From the above histograms, 3 variables that represent acidity of white wines seem to have long tailed distributions , except pH. Far - outliers were noticed for fixed acidity, volatile acidity and citric acid. For example, even though the mean citric acid content was only 0.3342, the maximum value in the dataset was 1.66, several fold higher than the mean! But the number of white wines showing these extreme variations from central tendency were low.

Histograms for visualising distribution of other

variables measuring chemical composition

Next, the other variables measuring the chemicalcomposition of white wines were plotted into histograms. As before, summaries of descriptive statistics were obtained to decide bin widths and axes.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00900 0.03600 0.04300 0.04577 0.05000 0.34600
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2.00   23.00   34.00   35.31   46.00  289.00
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     9.0   108.0   134.0   138.4   167.0   440.0
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2200  0.4100  0.4700  0.4898  0.5500  1.0800

From the above histograms, all 4 variables also seem to be long tailed distributions . Far - outliers were noticed for all variables, indicating significant variation in content. For example, the minimum and maximum values for free sulfur dioxide range from 2 to 289. But the closeness of the median and the mean indicate that the white wines showing extreme variations from central tendency were rare. On the whole salt content (chlorides) in the white wines tend to be low, while sulfur dioxides and dissoved sulphates, indicating presevatives to be around a standard contentration. The few outliers could have compromised taste.

Histograms for visualising distribution of interdependent

variables, alcohol, residual sugar and density of wines

using the function, get_histogram

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.600   1.700   5.200   6.391   9.900  65.800
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.00    9.50   10.40   10.51   11.40   14.20
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9871  0.9917  0.9937  0.9940  0.9961  1.0390
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The summaries and plots indicate that most white wines have residual sugar levels less than 9.9 g/dm3, but many outliers are seen. The range is narrower for density, with alcohol being a spread out distribution. Residual sugar indicates the sugar left over after fermentation, while alcohol is made by fermenting the sugar. Both influence density. These will be useful variables to analyse in the bivariate plots section.

Ratings to bin quality of white wines

To bin white wines by quality, a rating system to group quality into 3 categories, Low (Quality = 3 or 4), Average (Quality = 5 or 6) and High (Quality = 7, 8 or 9) was introduced. A new variable, “Rating” was created in the database.

## 
##    3    4    5    6    7    8    9 
##   20  163 1457 2198  880  175    5
##     Low Average    High 
##     183    3655    1060

As expected for the distribution based on quality, most wines can be rated ‘Average’, with very few rated ‘low’. ‘High’ rated wines made less than a fourth of the white wines analysed in this dataset.

Univariate Analysis

Structure and attributes of dataset

My dataset is a data frame with 4898 obs. of 13 variables. The variables are Wine ID, fixed acidity, volatile acidity, citric acid, residual sugar, chlorides free, sulfur dioxide, total sulfur dioxide, sulphates, alcohol,density and quality for 4898 white wines.

Main features of interest

To me, the main feature of interest in the data set is that a sensory measure (quality) for white wines is a conclusion of measurable proerties. How a complex interplay of physical and chemicalproperties affet the taster’s perception of quality, is the major point of interest.

Other features

The other 11 variables represent a combination of physical (density) and chemical (alcohol, sugars, pH) properties of the white wines. Some of the variables are known to to interdependent, such as fixed, acidity, citric acid and pH as well as density, residual sugars and alchol. These know associations make it easier to analyze the how main feature, quality of the white wines, is altered by physical and chemical properties.

New variables created

I created a new variable “Rating” to bin the white wines by quality quality into 3 categories, Low (Quality < 5), Average (Quality = 5 or 6) and High (Quality > 6). Most of the white wines fell in the Average category, with Low and High having only 183 and 1060 wines respectively.

Data wrangling, if any

The white wines data set is tidy and did not require any wrangling. The observations so far indicate that a clear correlation between a variable and quality may not be observed. Quality will more likely be a function of a combination of variables as will be clear in the bi and multivariate analysis sections.

Bivariate Plots Section

Scatterplot matrix of variables

I first ran a series of scatterplot matrices using GG pairs to get an overall idea of bivariate plots and correlations between the variables. I was prticularly interested in variables that affect the quality of white wines and other variables that in turn affect them. I have used only 5 variables per matrix for visual clarity. Positive and negative correlations with Pearson’s correlates > 0.4 were considered significant.

From the above scatter plot matrices, a clear, stromg correlation between quality and any single variable is not seen. + None of the acidity parameters have a significant correlation with quality, while, as expected, fixed acidity has a negative correlation with pH. + By far, the stronget correlations with quality come from alcohol (a positive correlation of 0.436) and density (a negative correlation of - 0.307). + Correlations between variables were also noticed - notably between density and residual sugar, residual sugar and alcohol and alcohol and density.

To follow up on the scatter plots and to make it easy to identify correlations, correlations between variables noted as interesting from the above matrices were calculated using the cor_quality function

##                          Quality and Alcohol 
##                                    0.4355747 
##                   Quality and Residual Sugar 
##                                   -0.3071233 
##                          Quality and Density 
##                                   -0.2099344 
##                         pH and Fixed acidity 
##                                   -0.4258583 
##             Density and Total sulfur dioxide 
##                                    0.5298813 
##                   Density and Residual sugar 
##                                    0.8389665 
##                          Density and Alcohol 
##                                   -0.7801376 
##                   Alcohol and Residual sugar 
##                                   -0.4506312 
##                        Alcohol and Chlorides 
##                                   -0.3601887 
## Free Sulfur dioxide and Total sulfur dioxide 
##                                    0.6155010

Boxplots to explore these correlations further.

To explore the correlations identified above, bivariate boxplots were plotted. First, qualoty was plotted against the 3 variables correlating with it.

Though outliers are evident, trends indicate + Increasing alcohol content with quality + Decreasing density with quality + Decreasing chloride content with quality. This correlation may be weaker than thought, simply because, very low levels of salt are uasually present in wines and at such low levels, differences may be hard to truly estimate.

Conditional means and summaries for alcohol and density

Alcohol and density had the highest correlations with quality. I then went on to group, derive means and medians, plot conditional means and summaries for these variables with quality.

Conditional means acsertain the trends observed form box plots and correlations- as quality increases, higher the alcohol content and lower the density of white wines.

Do fixed acidity and citric acid levels really influence white wine quality?

Acidity is a good indicator of taste. Fixed acidity is a known parameter that is measured in white wines as a marker for consumption.Volatile acidity is a critical marker for taste. Citric acid provides the sparkling taste and crisp finish to white wines. Both these variables affect taste. Though these variables show no correlation with quality, I revisited them using box plots with modified y axes, to see whether even a subtle visible trend is visible in white wines, since quality is very much related to taste perception.

The box plots do not show any obvious trends, as is to be expected from the scatter plot matrices. Fixed acidity correlates with pH, which is measured as one of the attributes for white wine consumption. All white wines have a pH between 3-4. It is possible that the mean pH is worth looking at, given the lack of correlation observed, due to the noisy nature of scatter plots. So I chose to see if the line of best fit ahows any discernible trend for pH.

Aha! Now I see a subtle relationship between pH and quality!! pH tends to be higher with quality. Though the difference in terms of measureble pH is small, subtle changes in pH may affect taste, as reflected by quality.

Bivariate Analysis

Major correlations observed

Alcohol and density correlate highest with quality of white wines and to a lesser extent chlorides. Other correlations with quality were lower in extent. The conditional meansfor alcohol and density over quality, show a direct relationship between mean alcohol content and quality. On the other hand, conditional means for density over quality, show an inverse relationship. Acidity is an important measure of taste in white wine. I was intriguedthat pH and acidity measures did not correlate with quality. But since acidity is an important factor in the finish of white wines and fixed acidity , a measure of fermentation of the grapes, I went on to analyse them with relation to quality. No real trends were discernible for acidity variables. It is possible that levels are too small for differential analysis. So I looked at a composite indicator of acidity, which in this dataset, is pH. It does seem like mean pH is higher, tending towards 3.3, for wines of higher quality. Though it is difficult to really discern a change in pH as small as 0.02 units, acidity may have a complex and subtle effect on taste and hence, quality.

Relationships among other variables

The relationship between variables, other than quality threw out some interesting correlations, especially among alcohol, density and residual sugar. There was a strong negative correlation between alcohol content and density of white wines and a similar negative correlation between alcohol and residual sugar, while density and residual sugar had strong positive correlation. This is to be expected since, sugar content reflects fermentation to alcohol (final alcohol content) as well as the consistency of the wine (reflected in its density). While, chlorides and total sulfur dioxide have some correltion with density. Also, pH correlates with fixed acidity, as should be expected.

Strongest relationship

The strongest relationship was between alcohol content and the quality of white wines, since our major concern here is quality. Among other variables, the strongest relationship was between density and residual sugar.

Multivariate Plots Section

To further explore how quality is affected by variables including alcohol, density and residual sugar, scatter plots were created and faceted by ratings, making it easier to categorize the effects. I use the Rating variable to facet plots, to simplify visualization and associate trends based on quality better.

The above plot shows the 3 most evident variables and their interplay in affecting the quality of white wines. High alcohol content and low density mark white wines. While densityis directly proportianl to residual sugar content, higher the quality of white wines, lower is their residual sugar content.

Other variables that could factor along with alcohol and ### density in white wine quality

The interplay between total sulfur dioxide and density or alcohol , when viewed with relationship to quality, is not very informative. except reiterate a possibly linearvrealitionship between total sulfur dioxide and density across quality. So, the variables that explain most about quality are density and alcohol content of white wines, with residual sugar content, which s a strong indicator of both, serving as the third variable.

Modeling the effects of variables on quality of white wine

A linear model for predicting white wine quality was generated, based on density, alcohol and residual sugar content.

## 
## Calls:
## m1: lm(formula = I(quality) ~ I(alcohol), data = P4)
## m2: lm(formula = I(quality) ~ I(alcohol) + density, data = P4)
## m3: lm(formula = I(quality) ~ I(alcohol) + density + residual.sugar, 
##     data = P4)
## 
## =====================================================
##                      m1          m2          m3      
## -----------------------------------------------------
##   (Intercept)      2.582***  -22.492***   90.313***  
##                   (0.098)     (6.165)    (12.374)    
##   I(alcohol)       0.313***    0.360***    0.246***  
##                   (0.009)     (0.015)     (0.018)    
##   density                     24.728***  -87.886***  
##                               (6.079)    (12.317)    
##   residual.sugar                           0.053***  
##                                           (0.005)    
## -----------------------------------------------------
##   R-squared            0.2        0.2         0.2    
##   adj. R-squared       0.2        0.2         0.2    
##   sigma                0.8        0.8         0.8    
##   F                 1146.4      583.3       434.1    
##   p                    0.0        0.0         0.0    
##   Log-likelihood   -5839.4    -5831.1     -5776.8    
##   Deviance          3112.3     3101.8      3033.7    
##   AIC              11684.8    11670.3     11563.6    
##   BIC              11704.3    11696.2     11596.1    
##   N                 4898       4898        4898      
## =====================================================
## 
## Calls:
## m4: lm(formula = I(alcohol) ~ I(quality), data = P4)
## m5: lm(formula = I(alcohol) ~ I(quality) + density, data = P4)
## m6: lm(formula = I(alcohol) ~ I(quality) + density + residual.sugar, 
##     data = P4)
## 
## =======================================================
##                      m4          m5           m6       
## -------------------------------------------------------
##   (Intercept)      6.957***   300.640***   531.416***  
##                   (0.106)      (3.652)      (5.821)    
##   I(quality)       0.605***     0.301***     0.145***  
##                   (0.018)      (0.012)      (0.011)    
##   density                    -293.647***  -525.877***  
##                                (3.651)      (5.847)    
##   residual.sugar                             0.153***  
##                                             (0.003)    
## -------------------------------------------------------
##   R-squared            0.2         0.7          0.8    
##   adj. R-squared       0.2         0.7          0.8    
##   sigma                1.1         0.7          0.6    
##   F                 1146.4      4565.8       5108.5    
##   p                    0.0         0.0          0.0    
##   Log-likelihood   -7450.7     -5387.7      -4491.6    
##   Deviance          6009.1      2588.1       1795.0    
##   AIC              14907.3     10783.4       8993.3    
##   BIC              14926.8     10809.4       9025.8    
##   N                 4898        4898         4898      
## =======================================================
##        fit      lwr      upr
## 1 17.40379 16.21019 18.59738

The above models are interesting in that, really no predictive value can be obtained from the first model in terms of quality. Whereas, it is possible to predict alcohol content based on quality, density and residual sugar, from the second model. the model estimate suggests an alcohol content of 17.4 %, with upper and lower limits at 16.2 and 18.6%, in keeping with our prediction that high quality wines with low density and residual sugar content, will have high alcohol content.

Multivariate Analysis

Interesting features observed

The really interesting feature of the plots is the lack of a clear trend to predict quality. Given that quality is a sensory preception, I think this is to be expected.

Model generation

I created 2 models, both linear. One was to predict quality based on alcohol content, followed by density and residual sugar content. No clear predictve value emerged from this- R-squared values were 0.2 for the 3 stages.The lack of strong correlations makes this a weak model.And since quality is a relative perception, I think using objective models to predict it is not possible.The second model was more a coin toss, it reversed the parameters of the first to predict alcohol content of white wines based on quality, followed by density and residual sugar content.Thought the R-squared value for alcohol content and quality was only 0.2, it dramatically jumped to 0.8 upon addition of density and residual sugars. And when used to predict alcohol content of a high rated wine, with low density and residual sugar content, it predicted a high alcohol content, in keeping with our analyses. This was a better model since alcohol is a defined, measurable variable well within the predictive power of an objective model.

Final Plots and Summary

Plot One. Alcohol content and density were the top 2

influencers of white wine quality

Description One

The boxplots of alcohol content and density grouped by quality clearly show the top variables that affect quality of white wines and the direct and inverse trends between quality and alcohol content and density, respectively. They also illustrate the existence of outliers, which result in a lack of strong correlations. Residual sugars affected quality indirectly, by affecting alcohol content and density.

Plot Two. Conditional mean of pH helps assess the

importance of pH in white wine quality

Description Two

Acidity sems to have subtle effects on quality. When the conditional mean of pH, a composite indicator of pH is plotted against quality, higher quality wines seem to have pH values tending towards 3.3. More acidic wines (lower pH) could harbor sharper tastes, hence lower quality. But, in terms of measurability, to discern a consistent differnce in pH between, say 3.2 and 3.28 is not practical, requiring extremely sensitive instruments. This could explain the wide range of pH measured and the difference in taste perceived.

Plot Three- What makes a wine high or low quality?

Description Three

To really see how wine quality increases with high alcohol and low density, high and low rated wines were plotted. High wines tended to have alcohol content above 10% and density below 0.9925 g/cm^3, while most inferior wines tended in the opposite direction. But to consider this a real trend, it needs to be reflected in the white wines that are rated Average , with quality between 5 and6, where most of the wines in the dataset fall. The plot between alcohol content and density, on the subset of Average wines colored by quality, clearly show a clustering of higher quality wines in the high alcohol/low density quadrant of the plot, compared to those of inferior quality (5). This plot illustrates cleraly that the discernible influences on quality in this data set are alcohol content and density of white wines.

Reflection

The dataset is tidy and has clearly documented variables which are known parameters considered by experts in deciding the quality of white wines. However, quality being a sensory judgement of individual tasters. Hence the lack of a clear correlation between quality and the other objectively measurable variables as well as the existence of many outliers. These aspects were the challenging attributes of this dataset. However, clearly, high alcohol content is desirable in a white wine, as is low density. Residual sugar affects both and indicates the extent of fermentation the wine has undergone. When these relationships became clear, I decided to focus on how they affect quality, to the exclusion of others, and could successfully delineate a relationship to explain quality and a model based on these variables. Curiously, pH and acidity parameters, which are also factors in fermentation and affect taste and finish of white wines showed no correlation with quality. This stumped me at first. But upon closer examination, acidity measurements were low and differentiating them enough to observe a clear correlation was impractical. When I persisted with conditional means, high pH and quality are related, indicating that the experts preferred low acidic wines, possibly due to the adverse effects of acidity on taste. But this distinction was very subtle and was not visible with conventional plotting. Other factors could have affected the expert’s perception of quality, beyond taste. These variables may have helped explore quality better and increase predictability, especially age, which is usually considered an important quality of wines. For future analysis, it will be useful to include a more objective measure of white wines, such as price as a surrogate for quality. This could help us navigate though the outliers better and propose a better model to predict the quality of white wine.